Which chemical properties influence the quality of white wines? by Mae Linh Fatum

Background

I will be the first to admit I know nothing about wine. I hoping that learning about the physicochemical properties in good wines will help me choose good wines to gift my friends instead of my usual method, ‘oooo, what a pretty label!’. When I do drink wines I tend to lean towards the sweeter whites, so that is why I decided to explore the white wine data.

There are 4898 observations with 13 variables as detailed below.

The following variables are being analyzed: * fixed acidity (tartaric acid - g / dm^3) * volatile acidity (acetic acid - g / dm^3) - high levels give a vinegar like quality. * citric acid (g / dm^3)- add ‘freshness’ and flavor to wines * residual sugar (g / dm^3) - sweet wines have >45 g/dm^3 * chlorides (sodium chloride - g / dm^3) - amount of salt * free sulfur dioxide (mg / dm^3) - dissolved S02 gas. prevents microbial growth and oxidation of wine.becomes evident in nose and taste at >50 mg /dm^3) * total sulfur dioxide (mg / dm^3) - contains both free and fixed forms. * density (g / cm^3) - changes depending on alcohol and sugar content. Note: density of water = 1, alcohol <1, sugar >1 * pH - 0 (very acidic) to 14 (very basic). Most wines are 3-4 * sulphates (potassium sulphate - g / dm3) - creates SO2. antimicrobial and antioxidant * alcohol (% by volume) Output variable (based on sensory data): * quality (score between 0 and 10) - 0 (very bad) to 10 (excellent) * rating (category) - Categorical classification of the quality score. Poor wines scored between 0-3, average wines scored between 4- 6, and excellent wines scored between 7-10. + Note: I added this column to the table in python, because I am more familar with python for data wrangling efforts.

The background text explains that these observations are for white variants of the Portuguese “Vinho Verde” wine. Also the quality of wine is not evenly distributed (a lot more average wines than poor or excellent ones.) Also they weren’t sure if every variable is relevant so I guess that is for me to figure out!

Univariate Plots Section

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality            rating    
##  Min.   :3.000   average  :3818  
##  1st Qu.:5.000   excellent:1060  
##  Median :6.000   poor     :  20  
##  Mean   :5.878                   
##  3rd Qu.:6.000                   
##  Max.   :9.000

Whoa there is a massive outlier in residual sugar. the max point is 65.8, when the 75% is only at 9.9. According to the documentation, ‘wines with greater than 45 grams/liter are considered sweet’, and in my limited wine knowledge I know that people have their preferences for dry or sweet wine. Gonna break the data set up into dry wines (sugar <45) and sweet wines (sugar >45)

## [1] "dry wine summary"
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2781   Mean   :0.3341   Mean   : 6.379  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :31.600  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0103   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality            rating    
##  Min.   :3.000   average  :3817  
##  1st Qu.:5.000   excellent:1060  
##  Median :6.000   poor     :  20  
##  Mean   :5.878                   
##  3rd Qu.:6.000                   
##  Max.   :9.000
## [1] "sweet wine summary"
##  fixed.acidity volatile.acidity  citric.acid  residual.sugar
##  Min.   :7.8   Min.   :0.965    Min.   :0.6   Min.   :65.8  
##  1st Qu.:7.8   1st Qu.:0.965    1st Qu.:0.6   1st Qu.:65.8  
##  Median :7.8   Median :0.965    Median :0.6   Median :65.8  
##  Mean   :7.8   Mean   :0.965    Mean   :0.6   Mean   :65.8  
##  3rd Qu.:7.8   3rd Qu.:0.965    3rd Qu.:0.6   3rd Qu.:65.8  
##  Max.   :7.8   Max.   :0.965    Max.   :0.6   Max.   :65.8  
##    chlorides     free.sulfur.dioxide total.sulfur.dioxide    density     
##  Min.   :0.074   Min.   :8           Min.   :160          Min.   :1.039  
##  1st Qu.:0.074   1st Qu.:8           1st Qu.:160          1st Qu.:1.039  
##  Median :0.074   Median :8           Median :160          Median :1.039  
##  Mean   :0.074   Mean   :8           Mean   :160          Mean   :1.039  
##  3rd Qu.:0.074   3rd Qu.:8           3rd Qu.:160          3rd Qu.:1.039  
##  Max.   :0.074   Max.   :8           Max.   :160          Max.   :1.039  
##        pH         sulphates       alcohol        quality        rating 
##  Min.   :3.39   Min.   :0.69   Min.   :11.7   Min.   :6   average  :1  
##  1st Qu.:3.39   1st Qu.:0.69   1st Qu.:11.7   1st Qu.:6   excellent:0  
##  Median :3.39   Median :0.69   Median :11.7   Median :6   poor     :0  
##  Mean   :3.39   Mean   :0.69   Mean   :11.7   Mean   :6                
##  3rd Qu.:3.39   3rd Qu.:0.69   3rd Qu.:11.7   3rd Qu.:6                
##  Max.   :3.39   Max.   :0.69   Max.   :11.7   Max.   :6

Okay so there is only one sweet wine, so I can’t really do any analysis on that. I will move forward with only the dry wines.

Interesting points: - citric acid has a minimum value of 0.0, which after a little bit of research is okay. Not all wines have citric acid. - Volatile acidity (VA), sugar, free sulfur dioxide (free SO2), density, and sulphates have outliers on the high end. - fixed acidity(FA), total SO2, citric acid (CA), and pH have outliers on both ends.

Let’s investigate these outliers and see if they are all from the same wine, if so, we can safely remove from our analysis.

Can’t disregard the outliers because there isn’t one wine that has all of them. Our outlier set has 3940 observations, so almost every single wine has an outlier quality of some sort. So in short we got lots of outliers in all the categories except alcohol and chlorides. So we will be zooming into the data using coor_cartesian to preserve the values, but getting plots of the majority of the data.

Fixed acidity was normally distributed, no transformation needed. Let’s look at the rating break down.

All the ratings peak at around the same point and they still maintain a generally normal distribution. FA is not a factor. Let’s run some t.tests to be sure

## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$fixed.acidity and avg_wines$fixed.acidity
## t = 1.8485, df = 19.049, p-value = 0.08011
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.0942196  1.5209422
## sample estimates:
## mean of x mean of y 
##  7.600000  6.886639
## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$fixed.acidity and exc_wines$fixed.acidity
## t = 2.2642, df = 19.143, p-value = 0.03536
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.06655039 1.68316659
## sample estimates:
## mean of x mean of y 
##  7.600000  6.725142
## 
##  Welch Two Sample t-test
## 
## data:  avg_wines$fixed.acidity and exc_wines$fixed.acidity
## t = 5.9055, df = 1845.3, p-value = 4.174e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1078635 0.2151310
## sample estimates:
## mean of x mean of y 
##  6.886639  6.725142

Well, I retract my previous statement. there is a significant difference in the mean of fixed acidity for excellent and average and poor wines, but there is no significant difference between poor and average means.

Volatile acidity was left skewed so let’s see if a transformation makes it normal:

That did the trick, so in order to compare VA with the other data we need to first take the square root. I decided to do the square root instead of log 10 because the data didn’t have enough of a spread for a logritmic scale. the square root did the job nicely!

VA is a factor in determining wine quality because the three graphs have the roughly the same shape, but the peak at slightly different values. Let’s run a t-test to see if the differences are significant

## 
##  Welch Two Sample t-test
## 
## data:  avg_wines$trans_VA and poor_wines$trans_VA
## t = -1.6805, df = 19.12, p-value = 0.1091
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.09740153  0.01062695
## sample estimates:
## mean of x mean of y 
## 0.5228504 0.5662377
## 
##  Welch Two Sample t-test
## 
## data:  avg_wines$trans_VA and exc_wines$trans_VA
## t = 5.0034, df = 1696.3, p-value = 6.217e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.009415845 0.021557703
## sample estimates:
## mean of x mean of y 
## 0.5228504 0.5073637
## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$trans_VA and exc_wines$trans_VA
## t = 2.2712, df = 19.431, p-value = 0.03468
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.00469983 0.11304830
## sample estimates:
## mean of x mean of y 
## 0.5662377 0.5073637

There is a significant difference in the mean amount of the sqaure root of the volatile acidity between excellent wines and poor and average wines. However there is no signficant difference between poor and average wines. Still there is some significance so VA is a factor!

citric acid normally distributed, so the mean is an appropiate average, no transformation necessary

Looks like we have peaks at different levles so let’s see if it is significantly different!

## 
##  Welch Two Sample t-test
## 
## data:  avg_wines$citric.acid and poor_wines$citric.acid
## t = 0.02026, df = 19.511, p-value = 0.984
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.03793915  0.03868214
## sample estimates:
## mean of x mean of y 
## 0.3363715 0.3360000
## 
##  Welch Two Sample t-test
## 
## data:  avg_wines$citric.acid and exc_wines$citric.acid
## t = 3.1807, df = 2759.8, p-value = 0.001486
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.003955953 0.016673831
## sample estimates:
## mean of x mean of y 
## 0.3363715 0.3260566
## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$citric.acid and exc_wines$citric.acid
## t = 0.54095, df = 19.703, p-value = 0.5946
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.02843636  0.04832315
## sample estimates:
## mean of x mean of y 
## 0.3360000 0.3260566

there is only a difference between average and excellent wines, so citric acid is a faint contender.

Definately skewed. Lets try a transform!

the sqrt transform resulted in a left skewed graph. When I did a log 10 transformation I see the the residual sugar is bimodal. Let’s see how the different ratings fit into that

Poor and average wines have more wines coming in at the higher sugar peak, while excellent wines have more wines with lower sugar contents. Sounds like a contender! We can’t do a t-test since the data is bimodal.

The chloride levels are skewed, let’s trasform!.

Look at the beautful normal curve. sqrt transformation did the trick!

Looks like the peaks are at slight different places. T-test time!

## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$trans_C and avg_wines$trans_C
## t = 0.49846, df = 19.069, p-value = 0.6239
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.02520204  0.04096329
## sample estimates:
## mean of x mean of y 
## 0.2226194 0.2147388
## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$trans_C and exc_wines$trans_C
## t = 1.8405, df = 19.103, p-value = 0.08128
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.003981942  0.062205054
## sample estimates:
## mean of x mean of y 
## 0.2226194 0.1935078
## 
##  Welch Two Sample t-test
## 
## data:  avg_wines$trans_C and exc_wines$trans_C
## t = 20.005, df = 2621.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.01914986 0.02331201
## sample estimates:
## mean of x mean of y 
## 0.2147388 0.1935078

there is a significant difference in the means chloride levels for average and excellent wines. Like citric acid, it might be a determining factor between average and excellent wines. I also created a new variable (trans_C) which is the sqrt of the chloride amount. Will be using theis value in investigations.

slightly skewed, lets transform!

The sqrt transformation did it! Looking nice and normal. Let’s see the ratings distribution.

average and excellent wine have about the same distributions, but the poor wines have multiple peaks. free SO2 isn’t a major determining factor for wine quality but might want to explore the poor wine free SO2 rating with other variables. let’s do a t.test to be sure.

## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$trans_fSO2 and avg_wines$trans_fSO2
## t = 0.52386, df = 19.028, p-value = 0.6064
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.374811  2.292887
## sample estimates:
## mean of x mean of y 
##  6.224715  5.765677
## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$trans_fSO2 and exc_wines$trans_fSO2
## t = 0.52731, df = 19.063, p-value = 0.6041
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.372183  2.296729
## sample estimates:
## mean of x mean of y 
##  6.224715  5.762442
## 
##  Welch Two Sample t-test
## 
## data:  avg_wines$trans_fSO2 and exc_wines$trans_fSO2
## t = 0.075356, df = 2111.9, p-value = 0.9399
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.08096016  0.08743069
## sample estimates:
## mean of x mean of y 
##  5.765677  5.762442

There is no significant difference in the sqrt of the free SO2 levels in wine. NOT a contender!

Total SO2 is skewed, so let’s transform it!

Much better! A sqrt transform was the answer. Let’s check out the ratings.

Whoa, it looks like poor wines have more total SO2 than average and excellent wines. Let’s do a t-test and see!

## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$trans_TSO2 and avg_wines$trans_TSO2
## t = 0.68249, df = 19.04, p-value = 0.5031
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.327322  2.612061
## sample estimates:
## mean of x mean of y 
##  12.40105  11.75868
## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$trans_TSO2 and exc_wines$trans_TSO2
## t = 1.3856, df = 19.086, p-value = 0.1818
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6655997  3.2755208
## sample estimates:
## mean of x mean of y 
##  12.40105  11.09609
## 
##  Welch Two Sample t-test
## 
## data:  avg_wines$trans_TSO2 and exc_wines$trans_TSO2
## t = 12.226, df = 2146.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.5563108 0.7688717
## sample estimates:
## mean of x mean of y 
##  11.75868  11.09609

Well that wasn’t what I expected! There is a significant difference between average and excellent meanof the sqrt of total SO2 levels, but no significant difference between poor wines and the other ratings. Still a contender.

Density is slightly skwed, but it is so tightly clustered I don’t think there will be much difference between the ratings. I’m not gonna transform this one due to the tightness of the data.Let’s check the ratings.

Hmmm, there are some different peaks, let’s do some t.tests for kicks and giggles.

## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$density and avg_wines$density
## t = 0.66862, df = 19.196, p-value = 0.5117
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.0009029601  0.0017515312
## sample estimates:
## mean of x mean of y 
## 0.9948840 0.9944597
## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$density and exc_wines$density
## t = 3.8708, df = 19.694, p-value = 0.0009742
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.001138508 0.003805530
## sample estimates:
## mean of x mean of y 
##  0.994884  0.992412
## 
##  Welch Two Sample t-test
## 
## data:  avg_wines$density and exc_wines$density
## t = 21.226, df = 1708.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.001858512 0.002236955
## sample estimates:
## mean of x mean of y 
## 0.9944597 0.9924120

Spoke too soon, there are significant differences between the mean density of excellent wines and poor or average wines. No diff in poor and avg tho.

the pH is almost normally distributed. Let’s check those ratings!

It looks like poor wine has a slighter higher (less acidic) pH than the other ratings. t.tests to confirm!

## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$pH and avg_wines$pH
## t = 0.14352, df = 19.099, p-value = 0.8874
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.09155584  0.10504156
## sample estimates:
## mean of x mean of y 
##  3.187500  3.180757
## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$pH and exc_wines$pH
## t = -0.58582, df = 19.404, p-value = 0.5647
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.12621679  0.07095264
## sample estimates:
## mean of x mean of y 
##  3.187500  3.215132
## 
##  Welch Two Sample t-test
## 
## data:  avg_wines$pH and exc_wines$pH
## t = -6.3777, df = 1617.8, p-value = 2.34e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.04494674 -0.02380313
## sample estimates:
## mean of x mean of y 
##  3.180757  3.215132

Looks like there is a signifcant difference between the average and excellent wines, but not between poor and any other!

Sulphates are slightly skewed to the left so let’s transform them!

Much better. sqrt transformation did the trick. Let’s check out the ratings:

Looking pretty similar, but let’s do the t-tests just to be sure!

## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$trans_sulp and avg_wines$trans_sulp
## t = -0.52255, df = 19.153, p-value = 0.6073
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.05039506  0.03025001
## sample estimates:
## mean of x mean of y 
## 0.6837169 0.6937894
## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$trans_sulp and exc_wines$trans_sulp
## t = -0.9055, df = 19.812, p-value = 0.3761
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.05817715  0.02297222
## sample estimates:
## mean of x mean of y 
## 0.6837169 0.7013194
## 
##  Welch Two Sample t-test
## 
## data:  avg_wines$trans_sulp and exc_wines$trans_sulp
## t = -2.4669, df = 1484.4, p-value = 0.01374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.013517393 -0.001542489
## sample estimates:
## mean of x mean of y 
## 0.6937894 0.7013194

Glad I checked. there is a significant difference between the mean of the sqrt of the sulphate levels for avergae and excellent wines, not with poor wines.

Alcohol very left skewed with a peak at about 9.5 grams / liter (mean = 10.51 and median =10.4). Gonna see if plot the log(10) of the Alcohol content will normalize the graph.

Neither the log nor the sqrt helped. The log10() give multiple peaks, but a generally normalize shape. Not going to apply a transformation here, since neither transformed the data. Let’s see if the ratings grouping of the original data has any insights.

definately have some different distributions. Lets run the t.tests and find out how significant these differences are.

## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$alcohol and avg_wines$alcohol
## t = 0.29377, df = 19.161, p-value = 0.7721
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4931970  0.6543541
## sample estimates:
## mean of x mean of y 
##  10.34500  10.26442
## 
##  Welch Two Sample t-test
## 
## data:  poor_wines$alcohol and exc_wines$alcohol
## t = -3.8747, df = 19.761, p-value = 0.0009603
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.6480638 -0.4939803
## sample estimates:
## mean of x mean of y 
##  10.34500  11.41602
## 
##  Welch Two Sample t-test
## 
## data:  avg_wines$alcohol and exc_wines$alcohol
## t = -27.118, df = 1539.4, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.234898 -1.068304
## sample estimates:
## mean of x mean of y 
##  10.26442  11.41602

No significant difference between the mean alcohol content of poor and average wine, but there is a significant difference between excellent and poor or average wine.

It is clear to see that we have a lot more average wines than any other rating. At least excellent wines are visible, while poor wines just aren’t even a contender.

Univariate Analysis

What is the structure of your dataset?

wine_qual has 4898 observations and originally 11 variables. 10 of the variables are based on physiochemcial tests on the wine and one (quality)is based on sensory input from wine critics. White can either be sweet or dry and so I wanted to seperate these wines from themselves. It turns out there is only one sweet wine (residual sugar >45 g/L), so I just excluded it from my data set. I created a categorical variable, rating, in which I assigned a rating to a group od quality scores. Poor wines score between 0 and 3, average wines scored between 4 and 6, and excellent wines scored between 7 and 10. There are 20 poor wines, 3817 average wines, and 1060 excellent wines. I also performed transformations on the following variables to normalize their data: volatile acidity, chlorides, free sulfur dioxide, total sulfur dioxide, and sulphates all recieved a sqrt transformation. Sugar recieves a log_10 transformation

What is/are the main feature(s) of interest in your dataset?

I want to figure out which chemical properties lead to a higher wine rating. Pretty much every chemical property yeilded a statsitically significant difference in their mean values between average and excellent wine. The only exception was free sulfur dioxide. However, only a few yeilded statstically significant differences between their mean values between poor and excellent wines. These variables are fixed acidity, the sqrt of volatile acidity, density, and alcohol. I will be investigate these variables further in the bivartiate plot section.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Even though free sulfur dioxide didn’t yeild stastically significant differences in mean values for the ratings, it has an interesting property. If the free sulfur levels are above 50 ppm, they are detectable and so it would be interesting to see how detectable free sulfur dioxide levels affect the ratings. Also residual sugar had an interesting result. When I did a logrithmic transformation on it, it turned out to be bimodal. the ratings reflected this distribution as well, but excellent wines had larger peak for lower sugar amounts, while average and excellent wine had the larger peak for higher sugar amounts.

Did you create any new variables from existing variables in the dataset?

I created a factor variable called ‘rating’ to give qualitative meaning to the ‘quality’ measure. In the next section I plan on using these ratings to see what range each rating has in each of the physiochemical properties.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I preformed sqrt transformations on volatile acidity, chlorides, free and total sulfur dioxide, and sulphates. I did this transformation to normalize the data, so I can compare it to other data down the line. I also performed a log 10 transformation on residual sugar to normalize it, but I ended up with a bimodal distribution. the sqrt transformation didn’t help normalize the data, so it’ll be interesting to see what behavior shows up when comparing other variables to sugar!

Bivariate Plots Section

dry_wine_qual2 will be my main dataframe for the bivariate analysis. I deleted the orginal forms of the the transformed varialbes (volatile acidity, chlorides, total sulfur dioxide, sugar and suplhates). I also deleted the free.sulfur.dioxide and its transformed column because there wasn’t a significant difference in its mean values between ratings. I also deleted the quality column because I’m going to use my rating columns to perform my analysis, not the specific numerical scores.

Need to reorder my rating factors (right now its alphabetical) and I want to take a closer look at the boxplots for quality and other variables. I don’t need to really investigate density since the values are so tightly clustered around 0.99

No suprise that sugar and alcohol have a strong relationship with density. I’m surprised chlorides didn’t, since salt can greatly affect the density of a liquid. Most other relationships were weakly correlated. We are interested in investigating the physiochemical traits that result in a high quality rating. so let’s start at the histograms and boxplots.

## 
##   average excellent      poor 
##      3817      1060        20

I already know that the differences are significant, and it seems that the median fixed acidity levels decrease as the ratings get better. Also there are a lot of outliers.

A note of all the t-tests done in the previous sections. They have to be unpaired t-tests because each rating doesn’t have the same number of observations. Also there are not that many poor ratings so that could be a reason for no significant difference between poor and average ratings.

Again I know that the difference between the means is statistically significant, the medians don’t show a particular pattern, but looking at the table of means so far, the mean value of the sqrt of the volatile acid decreases as the rating improves.

There are quite a few outlierrs on the average and excellent ratings. I already know the difference of the means is significant, so it appears that as the rating improves the density decreases.

There are a few outliers for the average rating, but it is clear that the higher rating has the higher alcohol content.

##     ratings FA_means TVA_means   D_means AL_means
## 1      poor 7.600000 0.5662377 0.9948840 10.34500
## 2   average 6.886639 0.5228504 0.9944597 10.26442
## 3 excellent 6.725142 0.5073637 0.9924120 11.41602

The t tests for these four variables reveales that the difference of thier mean values is significant. Findings: * Fixed acidity (FA) decreases as rating increases. * sqrt of the volatile acidity decreases as rating increases * density decreases as rating increases * alcohol content is about the same for poor and average wines, but it is higher for excellent wines.

Looking forward to exploring these variables in more depth in the next section! Only explored these variables because they allowed us to reject the null hypothesis of no difference between the means of the values for each rating.

Let’s now take a look at sugar. I want to explore sugar because it had a bimodal distribution over all and for each rating. The excellent wines had a higher peak at lower sugar levels than poor or average wines.

sugar also has a strong realtionship with density and a weak relationship with total sulfur dioxide.

## [1] "poor wine log10 sugar summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1549  0.2007  0.6628  0.6233  1.0293  1.2095
## [1] "average wine log10 sugar summary"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2218  0.2304  0.7782  0.6615  1.0170  1.4997
## [1] "excellent wine log10 sugar summary"
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.09691  0.25527  0.58826  0.57646  0.86923  1.28443

There is no clear pattern here, average wine have a higher mean than both poor and excellent, but excellent wines have a lower mean than poor wines. maybe some patterns will emerge when sugar is compared to other values.

Here we can see that more excellent wines peak at a lower sugar level, while average a poor wines have a higher peak at higher sugar levels. I would say sugar level isn’t a major determining factor, since all three ratings share peaks at about the same place, but it is clear that more excellent wines have a lower sugar amount than average or excellent wines.

the transformation of sugar and density had a strong correlation. As the sugar increases, so does the dentsity. That makes sens since density is dervived from the components in the liquid. There is a bit of an odd ball point out there. Definately wantt to add ratings to this chart to see if there are more patterns.

as the amount of total sulfur dioxide increases, the amount of residual sugar increases as well. Can’t wait to see what the ratings will tell us! Note: used dry_wine_qual because I hadn’t added the transformed sugar variable to dry_wine_ qual2

total sulfur dioxide and density have a moderate relationship as show inn the plot. as the amount of total sulfur dioxide increases the density increases.

As the alcohol content increases, the density decreases. Choose to look at this graph because it has one of the strongest correlations in the ggpairs plot. Definately wantt to look at this plot with ratings and see if there are any patterns

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I investgated the four chemical properties that displayed statistically significant differences for their mean values between the ratings. These chemical properties are the fixed acidity, the square root of the volatile acidity, the density, and the alchol content. I found that the mean value for all of them except alcohol content decreases as the rating increased. For alcohol content then mean increases as the rating increases.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I also looked at the relationships that displayed some amount of correlation on the ggpairs summary. I looked at density with , alcohol, sugar, and the square root of the total amount of sulfur dioxide. The density decreases when alcohol increases, density increases when sugar and total sulfur dioxide increases. Another relationship I looked at was the sqrt of total sulfur dioxide and sugar. The data points seem to be in two big clusters, so I really want to add ratings to that graph to see if any patterns show up.

What was the strongest relationship you found?

density was strongly related to sugar and alcohol. This isn’t suprising since the density of a liquid is determined by the components in the liquid. More sugar suggests a higher density which makes sense since the density of sugar is greater than 1. More alcohol suggests a lower density which also makes sense since the density of alcohol is less than one.

Multivariate Plots Section

the sqrt of free sulfur dioxide didn’t yeild any significant differences bewteen its mean values for the different ratings. I will use this to build plots with my four variables of interest (fixed acidity, sqrt of volatile acidity, density, and alcohol) and add rating as a color to see if any patterns arise.

The vertical red line represents the detectable threshold for free sulfur dioxide. the horizontal line represents the mean value for that y-value. I chose to smooth the average and the excellent ratings because they have so many more points than the poor rating, that it was impossible to see the poor ratings at all. Also the poor ratings were just all over the map, so there really isn’t any clear relationship betwen our four variables and a poor rating.

black lines represent the means for each value. excellent wines tend to have higher than averge alcohol content and lower than average densities.

Excellent wines tend NOT to have higher than average volatile acidity and higher than average densities.

The black lines represent the averages for each value. Excellent wines tend to have lower than average density and fixed acidity.

excellent wines have higher than average alcohol content. no conclusion for fixed acidity though.

as the alcohol content of excellent wine increases the amount of volatile acid increases.

no clear patterns here.

Findings so far: excellent wines have lower than average densities and higher than average alcohol content. These are great features to know for excellent wine because you can find them on the label (density is mass / volume). The other two variables didn’t provide many insights.

Let’s check out our other bivariate graphs that we wanted to add another layer to!

so as expected , the higher alcohol content corresponds to the lower densities. the red line represents the trend for excellent wines, the yellow represents the trend for average wines. For lower sugar ratings, excellent wines tend to have more alcohol and less density than average wines. However at higher sugar ratings (about 15 g) they have the same trends.

I was hoping for more. there are no clusters for ratings, alcohol or density, so I settled on looking at the trend for excelelnt and average. they both generally follow the same curve. No insights here.

I already knew that excellent wines tend to have a lower density. It also seems like excellent wines tend to have lower than average levels of total sulfur dioxide.

Because both volatile acidity and total sulfur dioxie had a sqrt transformation, I can compare the original values and get the same relationship. I wanted to look at this relationship because i was surprised that volatile acidity wasn’t more of a factor in determining the quality of wine. Volatile acidity is the measure of wine spoilage (http://waterhouse.ucdavis.edu/whats-in-wine/volatile-acidity), and is a by product of microbial metabolism. Sulfur dioxide is an anti-microbial additive to wine and it can reduce the amount of volatile acidity in wine. Interestingly enough, it seems like the sulfur dioxide is doing a more effective job in excellent wines (VA decreases with more SO2), not so much in average and poor wines. ( the VA increases as the SO2 increases!) need to calculate the correlations for the ratings to see if this si a strong, moderate, or weak relationship.

## [1] "r values for poor wines, avg_wines, and excellent wines respectively"
## [1]  0.2410981  0.1102411 -0.1000730

Okay so not the best correlation or any at all. This jsut means taht a linear fit is not a good choice to make a model. I’m still going to go off the trends because I think this is an interesting find!

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There was only one sweet wine, so only looked at comparing dry wines (residual sugar <45 g/L). My main features of intrest that I investigated were the fixed acidity, the sqrt of the volatile acidity, the density, and the alcohol content.

Initally I had rejected density as a factor because it was so tightly clustered, but it turns out that density is a clear determining factor for excellent wines. Also the alcohol content gave clear clusterings for the ratings. Excellent wines tend to have lower density than average wines. Also excellent wines tend to have higher alcohol content. Fixed acidity and volatile acidity didn’t yeild any decisive conclusions like density and alcohol.

Were there any interesting or surprising interactions between features?

I was surprised that high volatile acid levels didn’t really affect the ratings. Volatile acid is acetic acid, the same type of acid that is in vinegar. I would have expected larger amounts of VA to decrease the quality of the wine. after doing a bit of research (http://waterhouse.ucdavis.edu/whats-in-wine), mixing in sulfur dioxide is a way to decrease the amount of volatile acid. When we plot volative acid versus total sulfur dioxide, it turns out excellent wines have a lower VA amounts with higher SO2 levels, but bad and average wines do the opposite! their VA amounts increase as their SO2 levels increase.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Nope! No time! and not strong enough correlation between variables.

Final Plots and Summary

Plot One

Description One

I chose to look at a density plot instead of a histogram because the sample sizes for each rating are not even. there are 20 poor wines, 3817 average wines, and 1060 excellent wines. By doing a density plot, we can compare each group on an equal level. When I was trying to normalize the residual sugar histogram, I realized that it became bimodal under the log10 transformation. The in above plot we can see that each rating density is also bimodal, but excellent wines have a higer density peak at the lower sugar level and poor and average wines have higher densities at higher sugar levels.

Plot Two

Description Two

I had begun to think that there wasn’t really any one factor that really affected the rating of a wine until I got this graph. From this graph, it is clear to see that the density of the wine decreases as the alcohol content increases. All three ratings show that relationship. But we can see a clustering of the excellent ratings at the higher alcohol contents. From this graph is, we can say that excellent wines tend to have higher alcohol content.

Plot Three

Description Three

Both alcohol and sugar had the strongest correlation with density for our dataset. We can clearly see two bands of color going through the plot, light blue representing high alcohol content sweeping through at lower densities, dark blue representing low alcohol content sweeping up through the higher densities. this confirms the findings from plot two that higher alcohol content wines have lower densities. We can all see that the density of the wine increases as the amount of sugar increases. When I plotted the linear smooths for average and excellent wine it is interesting to see that excellent wines lie in the high alcohol content band while average wines lie in the low alcohol content band. This leads me to concluded that exceelent wines tend to have higher alcohol content and lower densities than average wine.

Reflection

This project taught me a lot about data exploration. I realize that having a little bit of knowledge about your data subject is super helpful. I have pretty much no experience with white wine so I went into this exploration without any preconcieved ideas on what makes a excellent wine. I do know science and I also have tasted really acidic, or salty, or sweet foods and I know that tolerance and taste preferences play a huge part in how a food gets rated. I was glad that each wine was rated by at least three different critics so each quality score was an average, not just one person’s opinion. Because I was trying to answer a question that is inherently subjective, I found it really difficult to pinpoint exactly which factors answer that question. I was able to narrow it down to from 10 to four after the Univariate analysis. Then I narrowed it down to two, alcohol content and density. Residual sugar was a definately an intrest, since the sugar level is what determines if a wine is sweet or dry. There was only one sweet (sugar >45 g/L) wine, so I only did my analysis on the dry whites. I know that people prefer certain levels of sweetness for thier wines, so I wasn’t surprised that it wasn’t a desicive factor like density or alcohol.

Another interesting factor was the free sulfur dioxide level (SO2). SO2 is an antimicrobial added to wine to preserve it. It is mainly used to prevent the build up of volatile acidity (acetic acid) in the wine. Free SO2 levels are generally undetectable, but once they exceed 50 ppm (or 50 g/L to keep it within context of the given units), they are detectable by taste and smell. While SO2 wasn’t really a determining factor for quality, it showed some interesting behavoir. Excellent wines showed decreased levels of volatile acidity for increased levels of SO2, while average and poor wines showed increased levels of volatile acid for increased levles of SO2! Seems like the mix in excellent wines allowes the SO2 to do its thing!

Over all I felt that since I was in the dark about the subject matter, I wanted to explore everything! I love the ggpairs function because it allowed me to take a snapshot of different pairs and decide which ones I really wanted to play with. I was honestly shocked that the quality of the wine seems to boil down to alcohol percentage and density. But in doing some research for this project at my local supermarket, I realized that they are the only things you can know about a wine from the label. So even if I had found another relationship, I wouldn’t be able to use it without a lab! But on the flip side since I wasn’t seeing any relationships I felt like I was going down a rabbit hole with no end in sight! But that is the nature of EDA.

While I prefer python for Data Wrangling, R is SO much nicer for visualizations. I love the adding of layers and figuring out how to incorporate different factors in different ways was really fun. I love that I can have four different factors on one graph (one per axes, one for color, and one for size). Talk about powerful information!

If I had more time to really go down the rabbit hole, I would want to try to make up a model for wine quality. It was facinating to see how transforming one factor by taking a square root or log could have changed the correlations. Before transforming some variables, I barely saw any correlation between the varaialbes, but witht he transformations it relaly helped bring some patterns to light.

So next time I’m choosing a white wine from this brand, I stand a good chance of choosing an excellent wine if it has an alcohol content greater than 10.5% and a density less than 0.994 g/cm^3.